Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a node with a failed reusable replica as still used #2650

Merged

Conversation

ejweber
Copy link
Collaborator

@ejweber ejweber commented Feb 26, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#8043

What this PR does / why we need it:

When replicaNodeSoftAntiAffinity == false, the scheduler should not schedule a second replica to a node that has a failed replica of the same volume. When replicaNodeSoftAntiAffinity == true, it should.

When replicaNodeSoftAntiAffinity == false and there is already a failed replica for a volume scheduled to a node, the scheduler should ONLY consider that node for scheduling if:

  • The failed replica is no longer usable (e.g. spec.rebuildRetryCount >= 5), or
  • replica-replenishment-wait-interval is exceeded.

@ejweber
Copy link
Collaborator Author

ejweber commented Feb 27, 2024

To test:

Follow the "Observe the root cause" steps from longhorn/longhorn#8043 (comment). I don't recommend trying to follow the "Cause a lockup" steps, because the reproducibility is low.

  1. Watch the replicas while the node reboots. At no point is any other replica scheduled to the node. The end result is a a volume with one running replica and two unscheduled replicas. The two unscheduled replicas are different than the ones we started with due to various replica cleanup and replenishment behaviors.
eweber@laptop:~/longhorn> kl get replica -w --output-watch-events
EVENT      NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   18m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           5m48s
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           5m47s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            error     eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b                                            18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            error     eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b                                            18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
DELETED    pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
DELETED    pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   18m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   19m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1            stopped                                                                                                                                                                           0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1            stopped                                                                                                                                                                           0s
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   22m
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1            stopped                                                                                                                                                                           3m43s
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1            stopped                                                                                                                                                                           3m43s

@ejweber ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch 2 times, most recently from cf71863 to b8d0efc Compare February 27, 2024 23:45
@ejweber
Copy link
Collaborator Author

ejweber commented Feb 28, 2024

This PR is currently problematic because it fails e2e tests like test_single_replica_failed_during_engine_start. That tests sets replica-replenishment-wait-interval = 0 and assumes that a new replica can be immediately spun up to replace the failed one. This PR does not allow the new replica to schedule to the existing node with a failed replica. (Everything will unblock after the failed replica attempts to rebuild five times and is deleted, but the way the docs for replica-replenishment-wait-interval are worded, we should NOT wait that long.)

@ejweber ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch 3 times, most recently from eaf176c to bb06506 Compare March 1, 2024 22:12
@ejweber ejweber changed the title Consider a node with a failed replica as still used Consider a node with a failed reusable replica as still used Mar 4, 2024
@ejweber
Copy link
Collaborator Author

ejweber commented Mar 4, 2024

After discussing it with @PhanLe1010 and @james-munson, we decided the node should be considered used only if the failed replica is potentially reusable and replica-replenishment-wait-interval has not been exceeded. This avoids breaking existing expected behavior (we will schedule another replica immediately after replica-replenishment-wait-interval, even if it is to a node containing a failed replica). With the changes, this PR is still effective in the original test case. However, the changes may limit the usefulness of this PR in some situations, as it is likely possible to manufacture a scenario in which we can still hit the issue. For example:

  • The volume has two replicas (instead of one).
  • The node that restarts is NOT the node running the engine. The volume becomes degraded instead of faulted.
  • The node takes longer than replica-replenishment-wait-interval to come back.
  • When it comes back, Longhorn schedules the third replica to the node, even though it already contains the second one.

I would prefer to avoid the above potential scenario. However, it should not result in a lockup like the original test case. I think it is better to maintain the existing behavior scheduling replicas to nodes with existing failed replicas with this best-effort fix.

shuo-wu
shuo-wu previously approved these changes Mar 5, 2024
Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the test part but the implementation LGTM

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 6, 2024

The general idea LGTM. Sorry, I cannot review this PR in detail as time pressure on other tasks. Will defer to @shuo-wu and @james-munson to drive the review

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Just a couple of small questions, but overall this is both clearer and more capable.

scheduler/replica_scheduler.go Show resolved Hide resolved
scheduler/replica_scheduler.go Show resolved Hide resolved
scheduler/replica_scheduler.go Show resolved Hide resolved
Only do this for the purposes of scheduling new replicas. Maintain
previous behavior when checking for reusable replicas.

Longhorn 8043

Signed-off-by: Eric Weber <eric.weber@suse.com>
Consider a node with a failed replica as used if the failed replica is
potentially reusable and replica-replenishment-wait-interval hasn't expired.

Longhorn 8043

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 8043

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch from e282109 to f371f2a Compare March 6, 2024 22:22
@shuo-wu shuo-wu merged commit e685946 into longhorn:master Mar 7, 2024
5 checks passed
@ejweber
Copy link
Collaborator Author

ejweber commented Mar 7, 2024

@mergify backport v1.6.x

Copy link

mergify bot commented Mar 7, 2024

backport v1.6.x

✅ Backports have been created

@ejweber
Copy link
Collaborator Author

ejweber commented Mar 7, 2024

@mergify backport v1.5.x

Copy link

mergify bot commented Mar 7, 2024

backport v1.5.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants